---
layout: post
title: "Stochastic Gradient Descent in Continuous Time"
date: 2023-09-28
---
Stochastic Gradient Descent in Continuous Time
Stochastic Gradient Descent in Continuous Time
Justin Sirignano* and Konstantinos Spiliopoulos ^(†‡){ }^{\dagger \ddagger}
October 31, 2017
Abstract
Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance. The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data. SGDCT performs an online parameter update in continuous time, with the parameter updates theta_(t)\theta_{t} satisfying a stochastic differential equation. We prove that lim_(t rarr oo)grad bar(g)(theta_(t))=0\lim _{t \rightarrow \infty} \nabla \bar{g}\left(\theta_{t}\right)=0 where bar(g)\bar{g} is a natural objective function for the estimation of the continuous-time dynamics. The convergence proof leverages ergodicity by using an appropriate Poisson equation to help describe the evolution of the parameters for large times. For certain continuous-time problems, SGDCT has some promising advantages compared to a traditional stochastic gradient descent algorithm. This paper mainly focuses on applications in finance, such as model estimation for stocks, bonds, interest rates, and financial derivatives. SGDCT can also be used for the optimization of high-dimensional continuous-time models, such as American options. As an example application, SGDCT is combined with a deep neural network to price high-dimensional American options (up to 100 dimensions).
1. Introduction
This paper develops a statistical learning algorithm for continuous-time models, which are common in science, engineering, and finance. We study its theoretical convergence properties as well as its computational performance in a number of benchmark problems. Although the method is broadly applicable, this paper mainly focuses on applications in finance. Given a continuous stream of data, stochastic gradient descent in continuous time (SGDCT) can estimate unknown parameters or functions in stochastic differential equation (SDE) models for stocks, bonds, interest rates, and financial derivatives. The statistical learning algorithm can also be used for the optimization of high-dimensional continuous-time models, such as American options. High-dimensional American options have been a longstanding computational challenge in finance. SGDCT is able to accurately solve American options even in 100 dimensions.
Batch optimization for the statistical estimation of continuous-time models can be impractical for large datasets where observations occur over a long period of time. Batch optimization takes a sequence of descent steps for the model error for the entire observed data path. Since each descent step is for the model error for the entire observed data path, batch optimization is slow (sometimes impractically slow) for long periods of time or models which are computationally costly to evaluate (e.g., partial differential equations). Typical existing approaches in the financial statistics literature use batch optimization.
SGDCT provides a computationally efficient method for statistical learning over long time periods and for complex models. SGDCT continuously follows a (noisy) descent direction along the path of the observation; this results in much more rapid convergence. Parameters are updated online in continuous time, with the parameter updates theta_(t)\theta_{t} satisfying a stochastic differential equation. We prove that lim_(t rarr oo)grad bar(g)(theta_(t))=0\lim _{t \rightarrow \infty} \nabla \bar{g}\left(\theta_{t}\right)=0 where bar(g)\bar{g} is a natural objective function for the estimation of the continuous-time dynamics.
Consider a diffusion X_(t)inX=R^(m)X_{t} \in \mathcal{X}=\mathbb{R}^{m} :
dX_(t)=f^(**)(X_(t))dt+sigma dW_(t).d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma d W_{t} .
The goal is to statistically estimate a model f(x,theta)f(x, \theta) for f^(**)(x)f^{*}(x) where theta inR^(n)\theta \in \mathbb{R}^{n}. The function f^(**)(x)f^{*}(x) is unknown. W_(t)inR^(m)W_{t} \in \mathbb{R}^{m} is a standard Brownian motion. The diffusion term W_(t)W_{t} represents any random behavior of the system or environment. The functions f(x,theta)f(x, \theta) and f^(**)(x)f^{*}(x) may be non-convex.
The stochastic gradient descent update in continuous time follows the SDE:
dtheta_(t)=alpha_(t)[grad_(theta)f(X_(t),theta_(t))(sigmasigma^(TT))^(-1)dX_(t)-grad_(theta)f(X_(t),theta_(t))(sigmasigma^(TT))^(-1)f(X_(t),theta_(t))dt],d \theta_{t}=\alpha_{t}\left[\nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left(\sigma \sigma^{\top}\right)^{-1} d X_{t}-\nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left(\sigma \sigma^{\top}\right)^{-1} f\left(X_{t}, \theta_{t}\right) d t\right],
where grad_(theta)f(X_(t);theta_(t))\nabla_{\theta} f\left(X_{t} ; \theta_{t}\right) is matrix valued and alpha_(t)\alpha_{t} is the learning rate. The parameter update 1.2 can be used for both statistical estimation given previously observed data as well as online learning (i.e., statistical estimation in real-time as data becomes available). SGDCT will still converge if sigmasigma^(TT)\sigma \sigma^{\top} in 1.2 is replaced by the identity matrix II.
Using the proposed approach of this paper, the stochastic gradient descent algorithm (1.2) can also be generalized to the case where sigma\sigma is a variable coefficient sigma^(**)(X_(t))\sigma^{*}\left(X_{t}\right). In that case, a model sigma(x,nu)\sigma(x, \nu) is also learned for sigma^(**)(x)\sigma^{*}(x) where nu inR^(k)\nu \in \mathbb{R}^{k} is an additional set of parameters. (To be more precise, sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu) is learned for sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x) since sigma^(**)(x)\sigma^{*}(x) is not identifiable.) See Section 4 for details and the corresponding convergence proof.
We assume that X_(t)X_{t} is sufficiently ergodic (to be concretely specified later in the paper) and that it has some well-behaved pi(dx)\pi(d x) as its unique invariant measure. As a general notation, if h(x,theta)h(x, \theta) is a generic L^(1)(pi)L^{1}(\pi) function, then we define its average over pi(dx)\pi(d x) to be
The gradient grad_(theta)g(X_(t),theta)\nabla_{\theta} g\left(X_{t}, \theta\right) cannot be evaluated since f^(**)(x)f^{*}(x) is unknown. However, dX_(t)=f^(**)(X_(t))dt+sigma dW_(t)d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma d W_{t} is a noisy estimate of f^(**)(x)dtf^{*}(x) d t, which leads to the algorithm 1.2. SGDCT follows a noisy descent direction along a continuous stream of data produced by X_(t)X_{t}.
Heuristically, it is expected that theta_(t)\theta_{t} will tend towards the minimum of the function bar(g)(theta)=int_(X)g(x,theta)pi(dx)\bar{g}(\theta)=\int_{\mathcal{X}} g(x, \theta) \pi(d x). The data X_(t)X_{t} will be correlated over time, which complicates the mathematical analysis. This differs from the standard discrete-time version of stochastic gradient descent where the the data is usually considered to be i.i.d. at every step.
2. 1.1quad1.1 \quad Literature Review
In this paper we show that if alpha_(t)\alpha_{t} is appropriately chosen then grad bar(g)(theta_(t))rarr0\nabla \bar{g}\left(\theta_{t}\right) \rightarrow 0 as t rarr oot \rightarrow \infty with probability 1 (see Theorem 2.4). Results like this have been previously derived for stochastic gradient descent in discrete time; see [9] and [8]. 9] proves convergence in the absence of the XX term. 8 , proves convergence of stochastic gradient descent in discrete time with the XX process but requires stronger conditions than 9 .
Although stochastic gradient descent for discrete time has been extensively studied, stochastic gradient descent in continuous time has received relatively little attention. We refer readers to [8, 16] and [9] for a thorough review of the very large literature on stochastic gradient descent. There are also many algorithms which modify traditional stochastic gradient descent (stochastic gradient descent with momentum, Adagrad, RMSprop, etc.). For a review of these variants of stochastic gradient descent, see [15]. We mention below the prior work which is most relevant to our paper.
Our approach and assumptions required for convergence are most similar to [9], who prove convergence of discrete-time stochastic gradient descent in the absence of the XX process. The presence of the XX process is essential for considering a wide range of problems in continuous time, and showing convergence with its presence is considerably more difficult. The XX term introduces correlation across times, and this correlation does not disappear as time tends to infinity. This makes it challenging to prove convergence in the continuoustime case. In order to prove convergence, we use an appropriate Poisson equation associated with XX to describe the evolution of the parameters for large times.
21] proves, in a setting different than ours, convergence in L^(2)L^{2} of projected stochastic gradient descent in discrete time for convex functions. In projected gradient descent, the parameters are projected back into an a priori chosen compact set. Therefore, the algorithm cannot hope to reach the minimum if the minimum is located outside of the chosen compact set. Of course, the compact set can be chosen to be very large for practical purposes. Our paper considers unconstrained stochastic gradient descent in continuous time and proves the almost sure convergence grad bar(g)(theta_(t))rarr0\nabla \bar{g}\left(\theta_{t}\right) \rightarrow 0 as t rarr oot \rightarrow \infty taking into account the XX component as well. We do not assume any stability conditions on XX (except that it is ergodic with a unique invariant measure).
Another approach for proving convergence of discrete-time stochastic gradient descent is to show that the algorithm converges to the solution of an ODE which itself converges to a limiting point. This is the approach of [8]. See also [16. This method, sometimes called the "ODE method", requires the assumption that the iterates (i.e., the model parameters which are being learned) remain in a bounded set with probability one. It is unclear whether the ODE method of proof can be successfully used to show convergence for a continuoustime stochastic gradient descent scheme. In this paper we follow a potentially more straightforward method of proof by analyzing the speed of convergence to equilibrium with an appropriately chosen Poisson type of equation.
25] studies continuous-time stochastic mirror descent in a setting different than ours. In the framework of [25], the objective function is known. In this paper, we consider the statistical estimation of the unknown dynamics of a random process (i.e. the XX process satisfying (1.1p).
Statisticians and financial engineers have actively studied parameter estimation of SDEs, although typically not with statistical learning or machine learning approaches. The likelihood function will usually be calculated from the entire observed path of XX (i.e., batch optimization) and then maximized to find the maximum likelihood estimator (MLE). Unlike in this paper, the actual optimization procedure to maximize the likelihood function is often not analyzed.
Some relevant publications in the financial statistics literature include [1, 2], 7], and [14. 7] derives the likelihood function for continuously observed XX. The MLE can be calculated via batch optimization. 1] and 2] consider the case where XX is discretely observed and calculate MLEs via a batch optimization approach. [14] estimates parameters by a Bayesian approach. Readers are referred to [10, 17, 26] for thorough reviews of classical statistical inference methods for stochastic differential equations.
2.1. Applications of SGDCT
Continuous-time models are especially common in finance. Given a continuous stream of data, the stochastic gradient descent algorithm can be used to estimate unknown parameters or functions in SDE models for stocks, bonds, interest rates, and financial derivatives. Numerical analysis of SGDCT for two common financial models is included in Sections 5.1, 5.2, and 5.5. The first is the well-known Ornstein-Uhlenbeck (OU) process (for examples in finance, see [19], [20, 29], and [18]). The second is the multidimensional CIR process which is a common model for interest rates (for examples in finance, see [4, [22], [11, [5], and [12]).
Scientific and engineering models are also typically in continuous-time. There are often coefficients or functions in these models which are uncertain or unknown; stochastic gradient descent can be used to learn these model parameters from data. In Section 5, we study the numerical performance for two example applications: Burger's equation and the classic reinforcement learning problem of balancing a pole on a moving cart. Burger's equation is a widely used nonlinear partial differential equation which is important to fluid mechanics, acoustics, and aerodynamics.
A natural question is why use SGDCT versus a straightforward approach which (1) discretizes the continuous-time dynamics and then (2) applies traditional stochastic gradient descent. For some of the same reasons that scientific models have been largely developed in continuous time, it can be advantageous to develop continuous-time statistical learning for continuous-time models.
SGDCT allows for the application of numerical schemes of choice to the theoretically correct statistical learning equation for continuous-time models. This can lead to more accurate and more computationally efficient parameter updates. Numerical schemes are always applied to continuous-time dynamics and different numerical schemes may have different properties for different continuous-time models. A priori performing a discretization to the system dynamics and then applying a traditional discrete-time stochastic gradient descent scheme can result in a loss of accuracy. For example, there is no guarantee that (1) using a higherorder accurate scheme to discretize the system dynamics and then (2) applying traditional stochastic gradient descent will produce a statistical learning scheme which is higher-order accurate in time. Hence, it makes sense to first develop the continuous-time statistical learning equation, and then apply the higher-order accurate numerical scheme.
Besides model estimation, SGDCT can be used to solve continuous-time optimization problems, such as American options. We combine SGDCT with a deep neural network to solve American options in up to 100 dimensions (see Section 6). An alternative approach would be to discretize the dynamics and then use the Q-learning algorithm (traditional stochastic gradient descent applied to an approximation of the discrete HJB equation). However, Q-learning is biased while SGDCT is unbiased. Furthermore, in SDE models with Brownian motions, the Q-learning algorithm can blow up as the time step size Delta\Delta becomes small; see Section 6 for details.
The convergence issue with Q-learning highlights the importance of studying continuous-time algorithms for continuous-time models. It is of interest to show that (1) a discrete-time scheme converges to an appropriate continuous-time scheme as Delta rarr0\Delta \rightarrow 0 and (2) the continuous-time scheme converges to the correct estimate as t rarr oot \rightarrow \infty. These are important questions since any discrete scheme for a continuous-time model incurs some error proportional to Delta\Delta, and therefore Delta\Delta must be decreased to reduce error. It is also important to note that in some cases, such as Q-learning, computationally expensive terms in the discrete algorithm (such as expectations over high-dimensional spaces) may become much simpler expressions in the continuous-time scheme (differential operators).
2.2. Organization of Paper
The paper is organized into five main sections. Section 2 presents the assumption and the main theorem. In Section 3 we prove the main result of this paper for the convergence of continuous-time stochastic gradient descent. The extension of the stochastic gradient descent algorithm to the case of a variable diffusion coefficient function is described in Section 4 . Section 5 provides numerical analysis of SGDCT for model estimation in several applications. Section 6 discusses SGDCT for solving continuous-time optimization problems, particularly focusing on American options.
3. Assumptions and Main Result
Before presenting the main result of this paper, Theorem 2.4, let us elaborate on the standing assumptions. In regards to the learning rate alpha_(t)\alpha_{t} the standing assumption is
Condition 2.1. Assume that int_(0)^(oo)alpha_(t)dt=oo,int_(0)^(oo)alpha_(t)^(2)dt < oo,int_(0)^(oo)|alpha_(s)^(')|ds < oo\int_{0}^{\infty} \alpha_{t} d t=\infty, \int_{0}^{\infty} \alpha_{t}^{2} d t<\infty, \int_{0}^{\infty}\left|\alpha_{s}^{\prime}\right| d s<\infty and that there is a p > 0p>0 such that lim_(t rarr oo)alpha_(t)^(2)t^(1//2+2p)=0\lim _{t \rightarrow \infty} \alpha_{t}^{2} t^{1 / 2+2 p}=0.
A standard choice for alpha_(t)\alpha_{t} that satisfies Condition 2.1 is alpha_(t)=(1)/(C+t)\alpha_{t}=\frac{1}{C+t} for some constant 0 < C < oo0<C<\infty. Notice that the condition int_(0)^(oo)|alpha_(s)^(')|ds < oo\int_{0}^{\infty}\left|\alpha_{s}^{\prime}\right| d s<\infty follows immediately from the other two restrictions for the learning rate if it is chosen to be a monotonic function of tt.
Let us next discuss the assumptions that we impose on sigma,f^(**)(x)\sigma, f^{*}(x) and f(x,theta)f(x, \theta). Condition 2.2 guarantees uniqueness and existence of an invariant measure for the XX process.
Condition 2.2. We assume that sigmasigma^(TT)\sigma \sigma^{\top} is non-degenerate bounded diffusion matrix and lim_(|x|rarr oo)f^(**)(x)*x=-oo\lim _{|x| \rightarrow \infty} f^{*}(x) \cdot x=-\infty
In addition, with respect to grad_(theta)f(x,theta)\nabla_{\theta} f(x, \theta) we assume that theta inR^(n)\theta \in \mathbb{R}^{n} and we impose the following condition
Condition 2.3. 1. We assume that grad_(theta)g(x,*)inC^(2)(R^(n))\nabla_{\theta} g(x, \cdot) \in C^{2}\left(\mathbb{R}^{n}\right) for all x inX,(del^(2)grad_(theta)g)/(delx^(2))in C(X,R^(n)),grad_(theta)g(*,theta)inx \in \mathcal{X}, \frac{\partial^{2} \nabla_{\theta} g}{\partial x^{2}} \in C\left(\mathcal{X}, \mathbb{R}^{n}\right), \nabla_{\theta} g(\cdot, \theta) \inC^(alpha)(X)C^{\alpha}(\mathcal{X}) uniformly in theta inR^(n)\theta \in \mathbb{R}^{n} for some alpha in(0,1)\alpha \in(0,1) and that there exist KK and qq such that
For every N > 0N>0 there exists a constant C(N)C(N) such that for all theta_(1),theta_(2)inR^(n)\theta_{1}, \theta_{2} \in \mathbb{R}^{n} and |x| <= N|x| \leq N, the diffusion coefficient grad_(theta)f\nabla_{\theta} f satisfies
The function f^(**)(x)f^{*}(x) is C^(2+alpha)(X)C^{2+\alpha}(\mathcal{X}) with alpha in(0,1)\alpha \in(0,1). Namely, it has two derivatives in xx, with all partial derivatives being Hölder continuous, with exponent alpha\alpha, with respect to xx.
Condition 2.3 allows one to control the ergodic behavior of the X process. As will be seen from the proof of the main convergence result Theorem 2.4. one needs to control terms of the form int_(0)^(t)alpha_(t)(grad( bar(g))(theta_(s))-g(X_(s),theta_(s)))ds\int_{0}^{t} \alpha_{t}\left(\nabla \bar{g}\left(\theta_{s}\right)-g\left(X_{s}, \theta_{s}\right)\right) d s. Due to ergodicity of the XX process one expects that such terms are small in magnitude and go to zero as t rarr oot \rightarrow \infty. However, the speed at which they go to zero is what matters here. We treat such terms by rewriting them equivalently using appropriate Poisson type partial differential equations (PDE). Condition 2.3 guarantees that these Poisson equations have unique solutions that do not grow faster than polynomially in the xx variable (see Theorem A.1 in Appendix A).
The main result of this paper is Theorem 2.4 .
Theorem 2.4. Assume that Conditions 2.1, 2.2 and 2.3 hold. Then we have that
lim_(t rarr oo)||grad( bar(g))(theta_(t))||=0," almost surely. "\lim _{t \rightarrow \infty}\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|=0, \text { almost surely. }
4. Proof of Theorem 2.4
We proceed in a spirit similar to that of [9]. However, apart from continuous versus discrete dynamics, one of the main challenges of the proof here is the presence of the ergodic XX process. Let us consider an arbitrarily given kappa > 0\kappa>0 and lambda=lambda(kappa) > 0\lambda=\lambda(\kappa)>0 to be chosen. Then set sigma_(0)=0\sigma_{0}=0 and consider the cycles of random times
{:[tau_(k)=i n f{t > sigma_(k-1):||grad( bar(g))(theta_(t))|| >= kappa}","],[sigma_(k)=s u p{t > tau_(k):(||grad( bar(g))(theta_(tau_(k)))||)/(2) <= ||grad( bar(g))(theta_(s))|| <= 2||grad( bar(g))(theta_(tau_(k)))||" for all "s in[tau_(k),t]" and "int_(tau_(k))^(t)alpha_(s)ds <= lambda}.]:}\begin{aligned}
\tau_{k} & =\inf \left\{t>\sigma_{k-1}:\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \geq \kappa\right\}, \\
\sigma_{k} & =\sup \left\{t>\tau_{k}: \frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} \leq\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leq 2\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \text { for all } s \in\left[\tau_{k}, t\right] \text { and } \int_{\tau_{k}}^{t} \alpha_{s} d s \leq \lambda\right\} .
\end{aligned}
The purpose of these random times is to control the periods of time where ||grad bar(g)(theta\| \nabla \bar{g}(\theta.) ||\| is close to zero and away from zero. Let us next define the random time intervals J_(k)=[sigma_(k-1),tau_(k))J_{k}=\left[\sigma_{k-1}, \tau_{k}\right) and I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right). Notice that for every t inJ_(k)t \in J_{k} we have ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappa.
Let us next consider some eta > 0\eta>0 sufficiently small to be chosen later on and set sigma_(k,eta)=sigma_(k)+eta\sigma_{k, \eta}=\sigma_{k}+\eta. Lemma 3.1 is crucial for the proof of Theorem 2.4.
Lemma 3.1. Assume that Conditions 2.1, 2.2 and 2.3 hold. Let us set
Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds.\Gamma_{k, \eta}=\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s .
Then, with probability one we have that
||Gamma_(k,eta)||rarr0", as "k rarr oo.\left\|\Gamma_{k, \eta}\right\| \rightarrow 0 \text {, as } k \rightarrow \infty .
Proof. The idea is to use Theorem A.1 in order to get an equivalent expression for the term Gamma_(k,eta)\Gamma_{k, \eta} that we seek to control.
Let us consider the function G(x,theta)=grad_(theta)g(x,theta)-grad_(theta) bar(g)(theta)G(x, \theta)=\nabla_{\theta} g(x, \theta)-\nabla_{\theta} \bar{g}(\theta). Notice that by definition and due to Condition 2.3 the function G(x,theta)G(x, \theta) satisfies the centering condition A.1 of Theorem A.1 componentwise. So, the Poisson equation A.2 will have a unique smooth solution, denoted by v(x,theta)v(x, \theta) that grows at most polynomially in xx. Let us apply Itô formula to the vector valued function u(t,x,theta)=alpha_(t)v(x,theta)u(t, x, \theta)=\alpha_{t} v(x, \theta). Doing so, we get for i=1,cdots,ni=1, \cdots, n
{:[u_(i)(sigma,X_(sigma),theta_(sigma))-u_(i)(tau,X_(tau),theta_(tau))=int_(tau)^(sigma)del_(s)u_(i)(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(x)u_(i)(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(theta)u_(i)(s,X_(s),theta_(s))ds],[+int_(tau)^(sigma)alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)u_(i)(s,X_(s),theta_(s))]ds],[+int_(tau)^(sigma)(:grad_(x)u_(i)(s,X_(s),theta_(s)),sigma dW_(s):)+int_(tau)^(sigma)alpha_(s)(:grad_(theta)u_(i)(s,X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)","]:}\begin{aligned}
u_{i}\left(\sigma, X_{\sigma}, \theta_{\sigma}\right)-u_{i}\left(\tau, X_{\tau}, \theta_{\tau}\right) & =\int_{\tau}^{\sigma} \partial_{s} u_{i}\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{x} u_{i}\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right) d s \\
& +\int_{\tau}^{\sigma} \alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right)\right] d s \\
& +\int_{\tau}^{\sigma}\left\langle\nabla_{x} u_{i}\left(s, X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\theta} u_{i}\left(s, X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle,
\end{aligned}
where L_(x)\mathcal{L}_{x} and L_(theta)\mathcal{L}_{\theta} denote the infinitesimal generators for processes XX and theta\theta respectively.
Recall now that v(x,theta)v(x, \theta) is the solution to the given Poisson equation and that u(s,x,theta)=alpha_(s)v(x,theta)u(s, x, \theta)=\alpha_{s} v(x, \theta). Using these facts and rearranging the previous Itô formula, we get in vector notation
{:[Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds=int_(tau_(k))^(sigma_(k,eta))L_(x)u(s,X_(s),theta_(s))ds],[=[alpha_(sigma_(k,eta))v(X_(sigma_(k,eta)),theta_(sigma_(k,eta)))-alpha_(tau_(k))v(X_(tau_(k)),theta_(tau_(k)))-int_(tau_(k))^(sigma_(k,eta))del_(s)alpha_(s)v(X_(s),theta_(s))ds]],[-int_(tau_(k))^(sigma_(k,eta))alpha_(s)[L_(theta)v(X_(s),theta_(s))+alpha_(s)tr [grad_(theta)f(X_(s),theta_(s))grad_(x_(i))grad_(theta)v(X_(s),theta_(s))]_(i=1)^(m)]ds],[-int_(tau_(k))^(sigma_(k,eta))alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)-int_(tau_(k))^(sigma_(k,eta))alpha_(s)^(2)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned}
\Gamma_{k, \eta}= & \int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s=\int_{\tau_{k}}^{\sigma_{k, \eta}} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s \\
= & {\left[\alpha_{\sigma_{k, \eta}} v\left(X_{\sigma_{k, \eta}}, \theta_{\sigma_{k, \eta}}\right)-\alpha_{\tau_{k}} v\left(X_{\tau_{k}}, \theta_{\tau_{k}}\right)-\int_{\tau_{k}}^{\sigma_{k, \eta}} \partial_{s} \alpha_{s} v\left(X_{s}, \theta_{s}\right) d s\right] } \\
& -\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left[\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x_{i}} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]_{i=1}^{m}\right] d s \\
& -\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle-\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}^{2}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle .
\end{aligned}
The next step is to treat each term on the right hand side of (3.1) separately. For this purpose, let us first set
J_(t)^((1))=alpha_(t)s u p_(s in[0,t])||v(X_(s),theta_(s))||J_{t}^{(1)}=\alpha_{t} \sup _{s \in[0, t]}\left\|v\left(X_{s}, \theta_{s}\right)\right\|
By Theorem A.1 and Proposition 2 of [23] there is some 0 < K < oo0<K<\infty (that may change from line to line below) and 0 < q < oo0<q<\infty such that for tt large enough
{:[E|J_(t)^((1))|^(2) <= Kalpha_(t)^(2)E[1+s u p_(s in[0,t])||X_(s)||^(q)]=Kalpha_(t)^(2)[1+sqrtt(Es u p_(s in[0,t])||X_(s)||^(q))/(sqrtt)]],[ <= Kalpha_(t)^(2)[1+sqrtt] <= Kalpha_(t)^(2)sqrtt]:}\begin{aligned}
\mathbb{E}\left|J_{t}^{(1)}\right|^{2} & \leq K \alpha_{t}^{2} \mathbb{E}\left[1+\sup _{s \in[0, t]}\left\|X_{s}\right\|^{q}\right]=K \alpha_{t}^{2}\left[1+\sqrt{t} \frac{\mathbb{E} \sup _{s \in[0, t]}\left\|X_{s}\right\|^{q}}{\sqrt{t}}\right] \\
& \leq K \alpha_{t}^{2}[1+\sqrt{t}] \leq K \alpha_{t}^{2} \sqrt{t}
\end{aligned}
By Condition 2.1 let us consider p > 0p>0 such that lim_(t rarr oo)alpha_(t)^(2)t^(1//2+2p)=0\lim _{t \rightarrow \infty} \alpha_{t}^{2} t^{1 / 2+2 p}=0 and for any delta in(0,p)\delta \in(0, p) define the event A_(t,delta)={J_(t)^((1)) >= t^(delta-p)}A_{t, \delta}=\left\{J_{t}^{(1)} \geq t^{\delta-p}\right\}. Then we have for tt large enough such that alpha_(t)^(2)t^(1//2+2p) <= 1\alpha_{t}^{2} t^{1 / 2+2 p} \leq 1
Therefore, by Borel-Cantelli lemma we have that for every delta in(0,p)\delta \in(0, p) there is a finite positive random variable d(omega)d(\omega) and some n_(0) < oon_{0}<\infty such that for every n >= n_(0)n \geq n_{0} one has
Thus for t in[2^(n),2^(n+1))t \in\left[2^{n}, 2^{n+1}\right) and n >= n_(0)n \geq n_{0} one has for some finite constant K < ooK<\infty
J_(t)^((1)) <= Kalpha_(2^(n+1))s u p_(s in(0,2^(n+1)])||v(X_(s),theta_(s))|| <= K(d(omega))/(2^((n+1)(p-delta))) <= K(d(omega))/(t^(p-delta)).J_{t}^{(1)} \leq K \alpha_{2^{n+1}} \sup _{s \in\left(0,2^{n+1}\right]}\left\|v\left(X_{s}, \theta_{s}\right)\right\| \leq K \frac{d(\omega)}{2^{(n+1)(p-\delta)}} \leq K \frac{d(\omega)}{t^{p-\delta}} .
The latter display then guarantees that for t >= 2^(n_(0))t \geq 2^{n_{0}} we have with probability one
J_(t)^((1)) <= K(d(omega))/(t^(p-delta))rarr0," as "t rarr ooJ_{t}^{(1)} \leq K \frac{d(\omega)}{t^{p-\delta}} \rightarrow 0, \text { as } t \rightarrow \infty
Next we consider the term
J_(t,0)^((2))=int_(0)^(t)||alpha_(s)^(')v(X_(s),theta_(s))+alpha_(s)(L_(theta)v(X_(s),theta_(s))+alpha_(s)tr [grad_(theta)f(X_(s),theta_(s))grad_(x_(i))grad_(theta)v(X_(s),theta_(s))]_(i=1)^(m))||ds.J_{t, 0}^{(2)}=\int_{0}^{t}\left\|\alpha_{s}^{\prime} v\left(X_{s}, \theta_{s}\right)+\alpha_{s}\left(\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x_{i}} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]_{i=1}^{m}\right)\right\| d s .
By the bounds of Theorem A.1 we see that there are constants 0 < K < oo0<K<\infty (that may change from line to line) and 0 < q < oo0<q<\infty such that
{:[s u p_(t > 0)E|J_(t,0)^((2))| <= Kint_(0)^(oo)(|alpha_(s)^(')|+alpha_(s)^(2))(1+E||X_(s)||^(q))ds],[ <= Kint_(0)^(oo)(|alpha_(s)^(')|+alpha_(s)^(2))ds],[ <= K]:}\begin{aligned}
\sup _{t>0} \mathbb{E}\left|J_{t, 0}^{(2)}\right| & \leq K \int_{0}^{\infty}\left(\left|\alpha_{s}^{\prime}\right|+\alpha_{s}^{2}\right)\left(1+\mathbb{E}\left\|X_{s}\right\|^{q}\right) d s \\
& \leq K \int_{0}^{\infty}\left(\left|\alpha_{s}^{\prime}\right|+\alpha_{s}^{2}\right) d s \\
& \leq K
\end{aligned}
The first inequality follows by Theorem A.1. the second inequality follows by Proposition 1 in [23. and the third inequality follows by Condition 2.1.
The latter display implies that there is a finite random variable bar(J)_(oo,0)^((2))\bar{J}_{\infty, 0}^{(2)} such that
J_(t,0)^((2))rarr bar(J)_(oo,0)^((2))", as "t rarr oo" with probability one. "J_{t, 0}^{(2)} \rightarrow \bar{J}_{\infty, 0}^{(2)} \text {, as } t \rightarrow \infty \text { with probability one. }
The last term that we need to consider is the martingale term
J_(t,0)^((3))=int_(0)^(t)alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)+int_(0)^(t)alpha_(s)^(2)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).J_{t, 0}^{(3)}=\int_{0}^{t} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{0}^{t} \alpha_{s}^{2}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle .
Notice that the Burkholder-Davis-Gundy inequality and the bounds of Theorem A.1 (doing calculations similar to the ones for the term {:J_(t,0)^((2)))\left.J_{t, 0}^{(2)}\right) give us that for some finite constant K < ooK<\infty, we have
s u p_(t > 0)E|J_(t,0)^((3))|^(2) <= Kint_(0)^(oo)alpha_(s)^(2)ds < oo\sup _{t>0} \mathbb{E}\left|J_{t, 0}^{(3)}\right|^{2} \leq K \int_{0}^{\infty} \alpha_{s}^{2} d s<\infty
that
Thus, by Doob's martingale convergence theorem there is a square integrable random variable bar(J)_(oo,0)^((3))\bar{J}_{\infty, 0}^{(3)} such
J_(t,0)^((3))rarr bar(J)_(oo,0)^((3))", as "t rarr oo" both almost surely and in "L^(2)". "J_{t, 0}^{(3)} \rightarrow \bar{J}_{\infty, 0}^{(3)} \text {, as } t \rightarrow \infty \text { both almost surely and in } L^{2} \text {. }
Let us now go back to 3.1 . Using the terms J_(t)^((1)),J_(t,0)^((2))J_{t}^{(1)}, J_{t, 0}^{(2)} and J_(t,0)^((3))J_{t, 0}^{(3)} we can write
The last display together with 3.2 , (3.3) and 3.4 imply the statement of the lemma.
Lemma 3.2. Assume that Conditions 2.1, 2.2 and 2.3 hold. Choose lambda > 0\lambda>0 such that for a given kappa > 0\kappa>0, one has 3lambda+(lambda)/(4kappa)=(1)/(2L_(grad bar(g)))3 \lambda+\frac{\lambda}{4 \kappa}=\frac{1}{2 L_{\nabla \bar{g}}}, where L_(grad bar(g))L_{\nabla \bar{g}} is the Lipschitz constant of grad bar(g)\nabla \bar{g}. For kk large enough and for eta > 0\eta>0 small enough (potentially random depending on kk ), one has int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambda. In addition we also have (lambda)/(2) <= int_(tau_(k))^(sigma_(k))alpha_(s)ds <= lambda\frac{\lambda}{2} \leq \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq \lambda with probability one.
Then, for any s inRs \in \mathbb{R} we have ||grad( bar(g))(theta_(s))||//R_(s) <= 2\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| / R_{s} \leq 2.
We proceed with an argument via contradiction. In particular let us assume that int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds <= lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \leq \lambda and let us choose arbitrarily some epsilon > 0\epsilon>0 such that epsilon <= lambda//8\epsilon \leq \lambda / 8.
Let us now make some remarks that are independent of the sign of int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds-lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s-\lambda. Due to the summability condition int_(0)^(oo)alpha_(t)^(2)dt < oo,(kappa)/(||grad( bar(g))(theta_(tau_(k)))||) <= 1\int_{0}^{\infty} \alpha_{t}^{2} d t<\infty, \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \leq 1 and Conditions 2.1 and 2.3 , we have that
s u p_(t > 0)E|int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)|^(2) < oo\sup _{t>0} \mathbb{E}\left|\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right|^{2}<\infty
Hence, the martingale convergence theorem applies to the martingale int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}. This means that there exists a square integrable random variable MM such that int_(0)^(t)alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)rarr\int_{0}^{t} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s} \rightarrowMM both almost surely and in L^(2)L^{2}. This means that for the given epsilon > 0\epsilon>0 there is kk large enough such that ||int_(tau_(k))^(sigma_(k,eta))alpha_(s)(kappa)/(||grad( bar(g))(theta_(tau_(k)))||)grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s)|| < epsilon\left\|\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} \frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\|<\epsilon almost surely.
Let us also assume that for the given k,etak, \eta is so small such that for any s in[tau_(k),sigma_(k,eta)]s \in\left[\tau_{k}, \sigma_{k, \eta}\right] one has ||grad( bar(g))(theta_(s))|| <=\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leq3||grad( bar(g))(theta_(tau_(k)))||3\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|.
Let us next bound appropriately the Euclidean norm of the vector-valued random variable
Gamma_(k,eta)=int_(tau_(k))^(sigma_(k,eta))alpha_(s)(grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)))ds.\Gamma_{k, \eta}=\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s}\left(\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right) d s .
By Lemma 3.1 we have that for the same 0 < epsilon < lambda//80<\epsilon<\lambda / 8 that was chosen before there is kk large enough such that almost surely
Hence, using also the fact that (kappa)/(||grad( bar(g))(theta_(tau_(k)))||) <= 1\frac{\kappa}{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|} \leq 1 we obtain
But then we would necessarily have that int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambda, since otherwise sigma_(k,eta)in[tau_(k),sigma_(k)]\sigma_{k, \eta} \in\left[\tau_{k}, \sigma_{k}\right] which is impossible.
Next we move on to prove the second statement of the lemma. By definition we have int_(tau_(k))^(sigma_(k))alpha_(s)ds <= lambda\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq \lambda. So it remains to show that (lambda)/(2) <= int_(tau_(k))^(sigma_(k))alpha_(s)ds\frac{\lambda}{2} \leq \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s. Since we know that int_(tau_(k))^(sigma_(k,eta))alpha_(s)ds > lambda\int_{\tau_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s>\lambda and because for kk large enough and eta\eta small enough one should have int_(sigma_(k))^(sigma_(k,eta))alpha_(s)ds <= lambda//2\int_{\sigma_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \leq \lambda / 2, we obtain that
int_(tau_(k))^(sigma_(k))alpha_(s)ds >= lambda-int_(sigma_(k))^(sigma_(k,eta))alpha_(s)ds >= lambda-lambda//2=lambda//2\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \geq \lambda-\int_{\sigma_{k}}^{\sigma_{k, \eta}} \alpha_{s} d s \geq \lambda-\lambda / 2=\lambda / 2
concluding the proof of the lemma.
Lemma 3.3 shows that the function bar(g)\bar{g} and its first two derivatives are uniformly bounded in theta\theta.
Lemma 3.3. Assume Conditions 2.1, 2.2 and 2.3. For any q > 0q>0, there is a constant KK such that
int_(X)(1+|x|^(q))pi(dx) <= C\int_{\mathcal{X}}\left(1+|x|^{q}\right) \pi(d x) \leq C
In addition we also have that there is a constant C < ooC<\infty such that sum_(i=0)^(2)||grad_(theta)^(i)( bar(g))(theta)|| <= C\sum_{i=0}^{2}\left\|\nabla_{\theta}^{i} \bar{g}(\theta)\right\| \leq C.
Proof. By Theorem 1 in [24, the density mu\mu of the measure pi\pi admits, for any pp, a constant C_(p)C_{p} such that mu(x) <= (C_(p))/(1+|x|^(p))\mu(x) \leq \frac{C_{p}}{1+|x|^{p}}. Choosing pp large enough that int_(X)(1+|x|^(q))/(1+|x|^(p))dy < oo\int_{\mathcal{X}} \frac{1+|x|^{q}}{1+|x|^{p}} d y<\infty, we then obtain
int_(X)(1+|x|^(q))pi(dx) <= int_(X)C_(p)(1+|x|^(q))/(1+|x|^(p))dx <= C.\int_{\mathcal{X}}\left(1+|x|^{q}\right) \pi(d x) \leq \int_{\mathcal{X}} C_{p} \frac{1+|x|^{q}}{1+|x|^{p}} d x \leq C .
concluding the proof of the first statement of the lemma. Let us now focus on the second part of the lemma. We only prove the claim for i=0i=0, since due to the bounds in Condition 2.3 the proof for i=1,2i=1,2 is the same. By Condition 2.3 and by the first part of the lemma, we have that there exist constants 0 < q,K,C < oo0<q, K, C<\infty such that
Our next goal is to show that if the index kk is large enough, then bar(g)\bar{g} decreases, in the sense of Lemma 3.4
Lemma 3.4. Assume Conditions 2.1. 2.2 and 2.3. Suppose that there are an infinite number of intervals I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right). There is a fixed constant gamma=gamma(kappa) > 0\gamma=\gamma(\kappa)>0 such that for kk large enough, one has
{:[ bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k)))=-int_(tau_(k))^(sigma_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds+int_(tau_(k))^(sigma_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(tau_(k))^(sigma_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds],[+int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds],[=Theta_(1,k)+Theta_(2,k)+Theta_(3,k)+Theta_(4,k).]:}\begin{aligned}
\bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) & =-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s+\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\
& +\int_{\tau_{k}}^{\sigma_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \\
& +\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s \\
& =\Theta_{1, k}+\Theta_{2, k}+\Theta_{3, k}+\Theta_{4, k} .
\end{aligned}
Let's first consider Theta_(1,k)\Theta_{1, k}. Notice that for all s in[tau_(k),sigma_(k)]s \in\left[\tau_{k}, \sigma_{k}\right] one has (||grad( bar(g))(theta_(tau_(k)))||)/(2) <= ||grad( bar(g))(theta_(s))|| <= 2||grad( bar(g))(theta_(tau_(k)))||\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|}{2} \leq\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\| \leq 2\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|. Hence, for sufficiently large kk, we have the upper bound:
Theta_(1,k)=-int_(tau_(k))^(sigma_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(4)int_(tau_(k))^(sigma_(k))alpha_(s)ds <= -(||grad( bar(g))(theta_(tau_(k)))||^(2))/(8)lambda,\Theta_{1, k}=-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{4} \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \leq-\frac{\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\|^{2}}{8} \lambda,
since Lemma 3.1 proved that int_(tau_(k))^(sigma_(k))alpha_(s)ds >= (lambda)/(2)\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s} d s \geq \frac{\lambda}{2} for sufficiently large kk.
We next address Theta_(2,k)\Theta_{2, k} and show that it becomes small as k rarr ook \rightarrow \infty. First notice that we can trivially write
{:[s u p_(t > 0)E|int_(0)^(t)alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)|^(2) <= 4Eint_(0)^(oo)alpha_(s)^(2)||grad_(theta)f(X_(s),theta_(s))||^(2)ds],[ <= Kint_(0)^(oo)alpha_(s)^(2)(1+E||X_(s)||^(q))ds < oo]:}\begin{aligned}
\sup _{t>0} \mathbb{E}\left|\int_{0}^{t} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle\right|^{2} & \leq 4 \mathbb{E} \int_{0}^{\infty} \alpha_{s}^{2}\left\|\nabla_{\theta} f\left(X_{s}, \theta_{s}\right)\right\|^{2} d s \\
& \leq K \int_{0}^{\infty} \alpha_{s}^{2}\left(1+\mathbb{E}\left\|X_{s}\right\|^{q}\right) d s<\infty
\end{aligned}
where R_(s)R_{s} is defined via (3.5). Hence, by Doob's martingale convergence theorem there is a square integrable random variable MM such that int_(0)^(t)alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))dW_(s):)rarr M\int_{0}^{t} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) d W_{s}\right\rangle \rightarrow M both almost surely and in L^(2)L^{2}. The latter statement implies that for a given epsilon > 0\epsilon>0 there is kk large enough such that almost surely
{:[s u p_(t > 0)E||int_(0)^(t)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds||],[ <= Cint_(0)^(oo)(alpha_(s)^(2))/(2)E(1+||X_(s)||^(q))ds < oo","]:}\begin{aligned}
& \sup _{t>0} \mathbb{E}\left\|\int_{0}^{t} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s\right\| \\
\leq & C \int_{0}^{\infty} \frac{\alpha_{s}^{2}}{2} \mathbb{E}\left(1+\left\|X_{s}\right\|^{q}\right) d s<\infty,
\end{aligned}
where we have used Condition 2.3 and Lemma 3.3 . Bound 3.7 implies that
int_(0)^(oo)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds\int_{0}^{\infty} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s
is finite almost surely, which in turn implies that there is a finite random variable Theta_(3)^(oo)\Theta_{3}^{\infty} such that
int_(0)^(t)(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds rarrTheta_(3)^(oo)" as "t rarr oo\int_{0}^{t} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \rightarrow \Theta_{3}^{\infty} \text { as } t \rightarrow \infty
with probability one. Since Theta_(3)^(oo)\Theta_{3}^{\infty} is finite, int_(tau_(k))^(sigma_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)grad_(theta)( bar(g))(theta_(s))]ds rarr0\int_{\tau_{k}}^{\sigma_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta} \nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right] d s \rightarrow 0 as k rarr ook \rightarrow \infty with probability one.
Finally, we address Theta_(4,k)\Theta_{4, k}. Let us consider the function G(x,theta)=(:grad_(theta)( bar(g))(theta),grad_(theta)g(x,theta)-grad_(theta)( bar(g))(theta):)G(x, \theta)=\left\langle\nabla_{\theta} \bar{g}(\theta), \nabla_{\theta} g(x, \theta)-\nabla_{\theta} \bar{g}(\theta)\right\rangle. The function G(x,theta)G(x, \theta) satisfies the centering condition A.1 of Theorem A.1. Therefore, the Poisson equation A.2 with right hand side G(x,theta)G(x, \theta) will have a unique smooth solution, say v(x,theta)v(x, \theta), that grows at most polynomially in xx. Let us apply Itô formula to the function u(t,x,theta)=alpha_(t)v(x,theta)u(t, x, \theta)=\alpha_{t} v(x, \theta) that is solution to this Poisson equation.
{:[u(sigma,X_(sigma),theta_(sigma))-u(tau,X_(tau),theta_(tau))=int_(tau)^(sigma)del_(s)u(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(x)u(s,X_(s),theta_(s))ds+int_(tau)^(sigma)L_(theta)u(s,X_(s),theta_(s))ds],[+int_(tau)^(sigma)alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)u(s,X_(s),theta_(s))]ds],[+int_(tau)^(sigma)(:grad_(x)u(s,X_(s),theta_(s)),sigma dW_(s):)+int_(tau)^(sigma)alpha_(s)(:grad_(theta)u(s,X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned}
u\left(\sigma, X_{\sigma}, \theta_{\sigma}\right)-u\left(\tau, X_{\tau}, \theta_{\tau}\right) & =\int_{\tau}^{\sigma} \partial_{s} u\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s+\int_{\tau}^{\sigma} \mathcal{L}_{\theta} u\left(s, X_{s}, \theta_{s}\right) d s \\
& +\int_{\tau}^{\sigma} \alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} u\left(s, X_{s}, \theta_{s}\right)\right] d s \\
& +\int_{\tau}^{\sigma}\left\langle\nabla_{x} u\left(s, X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\theta} u\left(s, X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle .
\end{aligned}
Rearranging the previous Itô formula yields
{:[Theta_(4,k)=int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(t)),grad_(theta)g(X_(s),theta_(s))-grad_(theta)( bar(g))(theta_(s)):)ds=int_(tau_(k))^(sigma_(k))L_(x)u(s,X_(s),theta_(s))ds],[=[alpha_(sigma_(k))v(X_(sigma_(k)),theta_(sigma_(k)))-alpha_(tau_(k))v(X_(tau_(k)),theta_(tau_(k)))-int_(tau_(k))^(sigma_(k))del_(s)alpha_(s)v(X_(s),theta_(s))ds]],[-int_(tau_(k))^(sigma_(k))alpha_(s)[L_(theta)v(X_(s),theta_(s))+alpha_(s)tr[grad_(theta)f(X_(s),theta_(s))grad_(x)grad_(theta)v(X_(s),theta_(s))]]ds],[-int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(x)v(X_(s),theta_(s)),sigma dW_(s):)-int_(tau_(k))^(sigma_(k))alpha_(s)(:grad_(theta)v(X_(s),theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):).]:}\begin{aligned}
\Theta_{4, k}= & \int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{t}\right), \nabla_{\theta} g\left(X_{s}, \theta_{s}\right)-\nabla_{\theta} \bar{g}\left(\theta_{s}\right)\right\rangle d s=\int_{\tau_{k}}^{\sigma_{k}} \mathcal{L}_{x} u\left(s, X_{s}, \theta_{s}\right) d s \\
= & {\left[\alpha_{\sigma_{k}} v\left(X_{\sigma_{k}}, \theta_{\sigma_{k}}\right)-\alpha_{\tau_{k}} v\left(X_{\tau_{k}}, \theta_{\tau_{k}}\right)-\int_{\tau_{k}}^{\sigma_{k}} \partial_{s} \alpha_{s} v\left(X_{s}, \theta_{s}\right) d s\right] } \\
& -\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left[\mathcal{L}_{\theta} v\left(X_{s}, \theta_{s}\right)+\alpha_{s} \operatorname{tr}\left[\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \nabla_{x} \nabla_{\theta} v\left(X_{s}, \theta_{s}\right)\right]\right] d s \\
& -\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{x} v\left(X_{s}, \theta_{s}\right), \sigma d W_{s}\right\rangle-\int_{\tau_{k}}^{\sigma_{k}} \alpha_{s}\left\langle\nabla_{\theta} v\left(X_{s}, \theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle .
\end{aligned}
Following the exact same steps as in the proof of Lemma 3.1 gives us that lim_(k rarr oo)||Theta_(4,k)||rarr0\lim _{k \rightarrow \infty}\left\|\Theta_{4, k}\right\| \rightarrow 0 almost surely.
We now return to bar(g)(theta_(sigma_(k)))- bar(g)(theta_(tau_(k)))\bar{g}\left(\theta_{\sigma_{k}}\right)-\bar{g}\left(\theta_{\tau_{k}}\right) and provide an upper bound which is negative. For sufficiently large kk, we have that:
Choose epsilon=min{(lambdakappa^(2))/(32),(lambda)/(32)}\epsilon=\min \left\{\frac{\lambda \kappa^{2}}{32}, \frac{\lambda}{32}\right\}. On the one hand, if ||grad( bar(g))(theta_(tau_(k)))|| >= 1\left\|\nabla \bar{g}\left(\theta_{\tau_{k}}\right)\right\| \geq 1 :
Finally, let gamma=(kappa^(2))/(32)lambda\gamma=\frac{\kappa^{2}}{32} \lambda and the proof of the lemma is complete.
Lemma 3.5. Assume Conditions 2.1, 2.2 and 2.3. Suppose that there are an infinite number of intervals I_(k)=[tau_(k),sigma_(k))I_{k}=\left[\tau_{k}, \sigma_{k}\right). There is a fixed constant gamma_(1) < gamma\gamma_{1}<\gamma such that for kk large enough,
Proof. First, recall that ||grad( bar(g))(theta_(t))|| <= kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \leq \kappa for t inJ_(k)=[sigma_(k-1),tau_(k)]t \in J_{k}=\left[\sigma_{k-1}, \tau_{k}\right]. Similar to before, we have that:
{:[ bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1)))=-int_(sigma_(k-1))^(tau_(k))alpha_(s)||grad( bar(g))(theta_(s))||^(2)ds+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(sigma_(k-1))^(tau_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)^(2)( bar(g))(theta_(s))]ds],[+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds],[ <= int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)],[+int_(sigma_(k-1))^(tau_(k))(alpha_(s)^(2))/(2)tr[(grad_(theta)f(X_(s),theta_(s))sigma^(-1))(grad_(theta)f(X_(s),theta_(s))sigma^(-1))^(TT)grad_(theta)^(2)( bar(g))(theta_(s))]ds],[+int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad_(theta)( bar(g))(theta_(s)),grad_(theta)( bar(g))(theta_(s))-grad_(theta)g(X_(s),theta_(s)):)ds.]:}\begin{aligned}
\bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) & =-\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\|\nabla \bar{g}\left(\theta_{s}\right)\right\|^{2} d s+\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\
& +\int_{\sigma_{k-1}}^{\tau_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta}^{2} \bar{g}\left(\theta_{s}\right)\right] d s \\
& +\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s \\
& \leq \int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \\
& +\int_{\sigma_{k-1}}^{\tau_{k}} \frac{\alpha_{s}^{2}}{2} \operatorname{tr}\left[\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)\left(\nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1}\right)^{\top} \nabla_{\theta}^{2} \bar{g}\left(\theta_{s}\right)\right] d s \\
& +\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla_{\theta} \bar{g}\left(\theta_{s}\right), \nabla_{\theta} \bar{g}\left(\theta_{s}\right)-\nabla_{\theta} g\left(X_{s}, \theta_{s}\right)\right\rangle d s .
\end{aligned}
The right hand side (RHS) of equation (3.8) converges almost surely to 0 as k rarr ook \rightarrow \infty as a consequence of similar arguments as given in Lemma 3.4 Indeed, the treatment of the second and third terms on the RHS of (3.8) are exactly the same as in Lemma 3.4. It remains to show that the first term on the RHS of (3.8) converges almost surely to 0 as k rarr ook \rightarrow \infty.
As shown in Lemma 3.4. int_(sigma_(k-1))^(tau_(k))alpha_(s)(:(grad( bar(g))(theta_(s)))/(R_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)rarr0\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\frac{\nabla \bar{g}\left(\theta_{s}\right)}{R_{s}}, \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \rightarrow 0 as k rarr ook \rightarrow \infty almost surely. Finally, note that ||grad( bar(g))(theta_(sigma_(k-1)))|| <= kappa\left\|\nabla \bar{g}\left(\theta_{\sigma_{k-1}}\right)\right\| \leq \kappa (except when sigma_(k-1)=tau_(k)\sigma_{k-1}=\tau_{k}, in which case the interval J_(k)J_{k} is length 0 and hence the integral 3.9 over J_(k)J_{k} is 0)). Then, int_(sigma_(k-1))^(tau_(k))alpha_(s)(:grad( bar(g))(theta_(s)),grad_(theta)f(X_(s),theta_(s))sigma^(-1)dW_(s):)rarr0\int_{\sigma_{k-1}}^{\tau_{k}} \alpha_{s}\left\langle\nabla \bar{g}\left(\theta_{s}\right), \nabla_{\theta} f\left(X_{s}, \theta_{s}\right) \sigma^{-1} d W_{s}\right\rangle \rightarrow 0 as k rarr ook \rightarrow \infty almost surely.
Therefore, with probability one, bar(g)(theta_(tau_(k)))- bar(g)(theta_(sigma_(k-1))) <= gamma_(1) < gamma\bar{g}\left(\theta_{\tau_{k}}\right)-\bar{g}\left(\theta_{\sigma_{k-1}}\right) \leq \gamma_{1}<\gamma for sufficiently large kk.
Proof of Theorem 2.4. Choose a kappa > 0\kappa>0. First, consider the case where there are a finite number of times tau_(k)\tau_{k}. Then, there is a finite TT such that ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappa for t >= Tt \geq T. Now, consider the other case where there are an infinite number of times tau_(k)\tau_{k} and use Lemmas 3.4 and 3.5 . With probability one,
Let n rarr oon \rightarrow \infty and then bar(g)(theta_(tau_(n+1)))rarr-oo\bar{g}\left(\theta_{\tau_{n+1}}\right) \rightarrow-\infty. However, we also have that by definition bar(g)(theta) >= 0\bar{g}(\theta) \geq 0. This is a contradiction, and therefore almost surely there are a finite number of times tau_(k)\tau_{k}.
Consequently, there exists a finite time TT (possibly random) such that almost surely ||grad( bar(g))(theta_(t))|| < kappa\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|<\kappa for t >= Tt \geq T. Since the original kappa > 0\kappa>0 was arbitrarily chosen, this shows that ||grad( bar(g))(theta_(t))||rarr0\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\| \rightarrow 0 as t rarr oot \rightarrow \infty almost surely.
5. Estimating the Coefficient Function of the Diffusion Term and Generalizations
We consider a diffusion X_(t)inX=R^(m)X_{t} \in \mathcal{X}=\mathbb{R}^{m} :
dX_(t)=f^(**)(X_(t))dt+sigma^(**)(X_(t))dW_(t).d X_{t}=f^{*}\left(X_{t}\right) d t+\sigma^{*}\left(X_{t}\right) d W_{t} .
The goal is to statistically estimate a model f(x,theta)f(x, \theta) for f^(**)(x)f^{*}(x) as well as a model sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu) for the diffusion coefficient sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x) where theta inR^(n)\theta \in \mathbb{R}^{n} and nu inR^(k)\nu \in \mathbb{R}^{k}. W_(t)inR^(m)W_{t} \in \mathbb{R}^{m} is a standard Brownian motion and sigma^(**)(*)inR^(m xx m)\sigma^{*}(\cdot) \in \mathbb{R}^{m \times m}. The functions f(x,theta),sigma(x,nu),f^(**)(x)f(x, \theta), \sigma(x, \nu), f^{*}(x), and sigma^(**)(x)\sigma^{*}(x) may be non-convex.
The stochastic gradient descent update in continuous time follows the stochastic differential equations:
{:[dtheta_(t)=alpha_(t)grad_(theta)f(X_(t),theta_(t))[dX_(t)-f(X_(t),theta_(t))dt]","],[dnu_(t)=alpha_(t)sum_(i,j)^(m)grad_(nu)((sigma(X_(t),nu_(t))sigma^(TT)(X_(t),nu_(t)))_(i,j))[d(:X_(t),X_(t):)_(i,j)-(sigma(X_(t),nu_(t))sigma^(TT)(X_(t),nu_(t)))_(i,j)dt]","]:}\begin{aligned}
d \theta_{t} & =\alpha_{t} \nabla_{\theta} f\left(X_{t}, \theta_{t}\right)\left[d X_{t}-f\left(X_{t}, \theta_{t}\right) d t\right], \\
d \nu_{t} & =\alpha_{t} \sum_{i, j}^{m} \nabla_{\nu}\left(\left(\sigma\left(X_{t}, \nu_{t}\right) \sigma^{\top}\left(X_{t}, \nu_{t}\right)\right)_{i, j}\right)\left[d\left\langle X_{t}, X_{t}\right\rangle_{i, j}-\left(\sigma\left(X_{t}, \nu_{t}\right) \sigma^{\top}\left(X_{t}, \nu_{t}\right)\right)_{i, j} d t\right],
\end{aligned}
where (:X_(t),X_(t):)inR^(m xx m)\left\langle X_{t}, X_{t}\right\rangle \in \mathbb{R}^{m \times m} is the quadratic variation matrix of XX. Since we observe the path of X_(t)X_{t}, we also observe the path of the quadratic variation (:X_(t),X_(t):)\left\langle X_{t}, X_{t}\right\rangle.
We assume that sigma^(**)(x)\sigma^{*}(x) is such that the process X_(t)X_{t} is ergodic with a unique invariant measure (for example one may assume that it is non-degenerate, i.e., bounded away from zero and bounded by above). In addition, we assume that w(x,nu)w(x, \nu) satisfies the same assumptions as g(x,theta)g(x, \theta) does in Condition 2.3 .
From the previous results in Section 3,lim_(t rarr oo)||grad( bar(g))(theta_(t))||=03, \lim _{t \rightarrow \infty}\left\|\nabla \bar{g}\left(\theta_{t}\right)\right\|=0 as t rarr oot \rightarrow \infty with probability 1 . Let's study the convergence of the stochastic gradient descent algorithm 4.2 for nu_(t)\nu_{t}. By Itô's formula,
bar(w)(nu_(sigma))- bar(w)(nu_(tau))=-int_(tau)^(sigma)alpha_(s)||grad( bar(w))(nu_(s))||^(2)ds+int_(tau)^(sigma)alpha_(s)(:grad_(nu)( bar(w))(nu_(s)),grad_(nu)( bar(w))(nu_(s))-grad_(nu)w(X_(s),nu_(s)):)ds\bar{w}\left(\nu_{\sigma}\right)-\bar{w}\left(\nu_{\tau}\right)=-\int_{\tau}^{\sigma} \alpha_{s}\left\|\nabla \bar{w}\left(\nu_{s}\right)\right\|^{2} d s+\int_{\tau}^{\sigma} \alpha_{s}\left\langle\nabla_{\nu} \bar{w}\left(\nu_{s}\right), \nabla_{\nu} \bar{w}\left(\nu_{s}\right)-\nabla_{\nu} w\left(X_{s}, \nu_{s}\right)\right\rangle d s
Applying exactly the same procedure as in Section 3,lim_(t rarr oo)||grad( bar(w))(nu_(t))||=03, \lim _{t \rightarrow \infty}\left\|\nabla \bar{w}\left(\nu_{t}\right)\right\|=0 as t rarr oot \rightarrow \infty with probability 1 . We omit the details as the proof is exactly the same as in Section 3.
Notice also that sigma^(**)(x)\sigma^{*}(x) is not identifiable; for example, X_(t)X_{t} has the same distribution under the diffusion coefficient -sigma^(**)(x)-\sigma^{*}(x). Only sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x) is identifiable. We are therefore essentially estimating a model sigma(x,nu)sigma^(TT)(x,nu)\sigma(x, \nu) \sigma^{\top}(x, \nu) for sigma^(**)(x)sigma^(**,TT)(x)\sigma^{*}(x) \sigma^{*, \top}(x).
We close this section with the following remark.
Remark 4.1. The proof of Theorem 2.4 makes it clear that if appropriate assumptions on grad_(theta)f\nabla_{\theta} f and grad_(g)theta\nabla_{g} \theta are made such that s u p_(t > 0)E|theta_(t)^(q)| < C\sup _{t>0} \mathbb{E}\left|\theta_{t}^{q}\right|<C for appropriate 0 < q,C < oo0<q, C<\infty, then one can relax Condition 2.3 on grad_(theta)g\nabla_{\theta} g to allow at least linear growth with respect to theta\theta.
6. Model Estimation: Numerical Analysis
We implement SGDCT for several applications and numerically analyze the convergence. Section 5.1 studies continuous-time stochastic gradient descent for the Ornstein-Uhlenbeck process, which is widely used in finance, physics, and biology. Section 5.2 studies the multidimensional Ornstein-Uhlenbeck process. Section 5.3 estimates the diffusion coefficient in Burger's equation with continuous-time stochastic gradient descent. Burger's equation is a widely-used nonlinear partial differential equation which is important to fluid mechanics, acoustics, and aerodynamics. Burger's equation is extensively used in engineering. In Section 5.4. we show how SGDCT can be used for reinforcement learning. In the final example, the drift and volatility functions for the multidimensional CIR process are estimated. The CIR process is widely used in financial modeling.
6.1. Ornstein-Uhlenbeck process
The Ornstein-Uhlenbeck (OU) process X_(t)inRX_{t} \in \mathbb{R} satisfies the stochastic differential equation:
dX_(t)=c(m-X_(t))dt+dW_(t).d X_{t}=c\left(m-X_{t}\right) d t+d W_{t} .
We use continuous-time stochastic gradient descent to learn the parameters theta=(c,m)inR^(2)\theta=(c, m) \in \mathbb{R}^{2}.
For the numerical experiments, we use an Euler scheme with a time step of 10^(-2)10^{-2}. The learning rate is alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t) with alpha=10^(-2)\alpha=10^{-2}. We simulate data from (5.1) for a particular theta^(**)\theta^{*} and the stochastic gradient descent attempts to learn a parameter theta_(t)\theta_{t} which fits the data well. theta_(t)\theta_{t} is the statistical estimate for theta^(**)\theta^{*} at time tt. If the estimation is accurate, theta_(t)\theta_{t} should of course be close to theta^(**)\theta^{*}. This example can be placed in the form of the original class of equations (1.1) by setting f(x,theta)=c(m-x)f(x, \theta)=c(m-x) and f^(**)(x)=f(x,theta^(**))f^{*}(x)=f\left(x, \theta^{*}\right).
We study 10,500 cases. For each case, a different theta^(**)\theta^{*} is generated uniformly at random in the range [1,2]xx[1,2][1,2] \times[1,2]. For each case, we solve for the parameter theta_(t)\theta_{t} over the time period [0,T][0, T] for T=10^(6)T=10^{6}. To summarize: - For cases n=1\mathrm{n}=1 to 10,500
Generate a random theta^(**)\theta^{*} in [1,2]xx[1,2][1,2] \times[1,2]
Simulate a single path of X_(t)X_{t} given theta^(**)\theta^{*} and simultaneously solve for the path of theta_(t)\theta_{t} on [0,T][0, T]
The accuracy of theta_(t)\theta_{t} at times t=10^(2),10^(3),10^(4),10^(5)t=10^{2}, 10^{3}, 10^{4}, 10^{5}, and 10^(6)10^{6} is reported in Table 1 . Figures 1 and 2 plot the mean error in percent and mean squared error (MSE) against time. In the table and figures, the "error" is |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right| where nn represents the nn-th case. The "error in percent" is 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}. The "mean error in percent" is the average of these errors, i.e. (100 )/(N)sum_(n=1)^(N)(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)\frac{100}{N} \sum_{n=1}^{N} \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}.
Figure 1: Mean error in percent plotted against time. Time is in log scale.
Figure 2: Mean squared error plotted against time. Time is in log scale.
Table 1: Error at different times for the estimate theta_(t)\theta_{t} of theta^(**)\theta^{*} across 10,500 cases. The "error" is |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right| where nn represents the nn-th case. The "error in percent" is 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}.
Finally, we also track the objective function bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right) over time. Figure 3 plots the error bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right) against time. Since the limiting distribution pi(x)\pi(x) of 5.2 is Gaussian with mean m^(**)m^{*} and variance (1)/(2c^(**))\frac{1}{2 c^{*}}, we have that:
{:[ bar(g)(theta)=int(c^(**)(m^(**)-x)-c(m-x))^(2)pi(x)dx],[=(c^(**)m^(**)-cm)^(2)+(c^(**)-c)^(2)((1)/(2c^(**))+(m^(**))^(2))+2(c^(**)m^(**)-cm)(c-c^(**))m^(**)]:}\begin{aligned}
\bar{g}(\theta) & =\int\left(c^{*}\left(m^{*}-x\right)-c(m-x)\right)^{2} \pi(x) d x \\
& =\left(c^{*} m^{*}-c m\right)^{2}+\left(c^{*}-c\right)^{2}\left(\frac{1}{2 c^{*}}+\left(m^{*}\right)^{2}\right)+2\left(c^{*} m^{*}-c m\right)\left(c-c^{*}\right) m^{*}
\end{aligned}
Figure 3: The error bar(g)(theta_(t))\bar{g}\left(\theta_{t}\right) plotted against time. The mean error and the quantiles of the error are calculated from the 10,500 cases. Time is in log scale.
6.2. Multidimensional Ornstein-Uhlenbeck process
The multidimensional Ornstein-Uhlenbeck process X_(t)inR^(d)X_{t} \in \mathbb{R}^{d} satisfies the stochastic differential equation:
dX_(t)=(M-AX_(t))dt+dW_(t).d X_{t}=\left(M-A X_{t}\right) d t+d W_{t} .
We use continuous-time stochastic gradient descent to learn the parameters theta=(M,A)inR^(d)xxR^(d xx d)\theta=(M, A) \in \mathbb{R}^{d} \times \mathbb{R}^{d \times d}. For the numerical experiments, we use an Euler scheme with a time step of 10^(-2)10^{-2}. The learning rate is alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t) with alpha=10^(-1)\alpha=10^{-1}. We simulate data from 5.2 for a particular theta^(**)=(M^(**),A^(**))\theta^{*}=\left(M^{*}, A^{*}\right) and the stochastic gradient descent attempts to learn a parameter theta_(t)\theta_{t} which fits the data well. theta_(t)\theta_{t} is the statistical estimate for theta^(**)\theta^{*} at time tt. If the estimation is accurate, theta_(t)\theta_{t} should of course be close to theta^(**)\theta^{*}. This example can be placed in the form of the original class of equations (1.1) by setting f(x,theta)=M-Axf(x, \theta)=M-A x and f^(**)(x)=f(x,theta^(**))f^{*}(x)=f\left(x, \theta^{*}\right).
The matrix A^(**)A^{*} must be generated carefully to ensure that X_(t)X_{t} is ergodic and has a stable equilibrium point. If some of A^(**)A^{*} 's eigenvalues have negative real parts, then X_(t)X_{t} can become unstable and grow arbitrarily large. Therefore, we randomly generate matrices A^(**)A^{*} which are strictly diagonally dominant. A^(**)A^{*} 's eigenvalues are therefore guaranteed to have positive real parts and X_(t)X_{t} will be ergodic. To generate random strictly diagonally dominant matrices A^(**)A^{*}, we first generate A_(i,j)^(**)A_{i, j}^{*} uniformly at random in the range [1,2] for i!=ji \neq j. Then, we set A_(i,i)^(**)=sum_(j!=i)A_(i,j)^(**)+U_(i,i)A_{i, i}^{*}=\sum_{j \neq i} A_{i, j}^{*}+U_{i, i} where U_(i,i)U_{i, i} is generated randomly in [1,2].M_(i)^(**)[1,2] . M_{i}^{*} for i=1,dots,di=1, \ldots, d is also generated randomly in [1,2][1,2].
We study 525 cases and analyze the error in Table 2 Figures 4 and 5 plot the error over time.
Table 2: Error at different times for the estimate theta_(t)\theta_{t} of theta^(**)\theta^{*} across 525 cases. The "error" is |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right| where nn represents the nn-th case. The "error in percent" is 100 xx(|theta_(t)^(n)-theta^(**,n)|)/(|theta^(**,n)|)100 \times \frac{\left|\theta_{t}^{n}-\theta^{*, n}\right|}{\left|\theta^{*, n}\right|}.
Figure 4: Mean error in percent plotted against time. Time is in log scale.
Figure 5: Mean squared error plotted against time. Time is in log scale.
6.3. Burger's Equation
The stochastic Burger's equation that we consider is given by:
(del u)/(del t)(t,x)=theta(del^(2)u)/(delx^(2))-u(t,x)(del u)/(del x)(t,x)+sigma(del^(2)W(t,x))/(del t del x),\frac{\partial u}{\partial t}(t, x)=\theta \frac{\partial^{2} u}{\partial x^{2}}-u(t, x) \frac{\partial u}{\partial x}(t, x)+\sigma \frac{\partial^{2} W(t, x)}{\partial t \partial x},
where x in[0,1]x \in[0,1] and W(t,x)W(t, x) is a Brownian sheet. The finite-difference discretization of (5.3)(5.3) satisfies a system of nonlinear stochastic differential equations (for instance, see [13] or [3]). We use continuous-time stochastic gradient descent to learn the diffusion parameter theta\theta.
We use the following finite difference scheme for Burger's equation:
du(t,x_(i))=theta(u(t,x_(i+1))-2u(t,x_(i))+u(t,x_(i-1)))/(Deltax^(2))dt-u(t,x_(i))(u(t,x_(i+1))-u(t,x_(i-1)))/(2Delta x)dt+(sigma)/(sqrt(Delta x))dW_(t)^(i),d u\left(t, x_{i}\right)=\theta \frac{u\left(t, x_{i+1}\right)-2 u\left(t, x_{i}\right)+u\left(t, x_{i-1}\right)}{\Delta x^{2}} d t-u\left(t, x_{i}\right) \frac{u\left(t, x_{i+1}\right)-u\left(t, x_{i-1}\right)}{2 \Delta x} d t+\frac{\sigma}{\sqrt{\Delta x}} d W_{t}^{i},
For our numerical experiment, the boundary conditions u(t,x=0)=0u(t, x=0)=0 and u(t,x=1)=1u(t, x=1)=1 are used and sigma=0.1\sigma=0.1. (5.4) is simulated with the Euler scheme (i.e., we solve Burger's equation with explicit finite difference). A spatial discretization of Delta x=.01\Delta x=.01 and a time step of 10^(-5)10^{-5} are used. The learning rate is alpha_(t)=min(alpha,alpha//t)\alpha_{t}=\min (\alpha, \alpha / t) with alpha=10^(-3)\alpha=10^{-3}. The small time step is needed to avoid instability in the explicit finite difference scheme. We simulate data from 5.3 for a particular diffusion coefficient theta^(**)\theta^{*} and the stochastic gradient descent attempts to learn a diffusion parameter theta_(t)\theta_{t} which fits the data well. theta_(t)\theta_{t} is the statistical estimate for theta^(**)\theta^{*} at time tt. If the estimation is accurate, theta_(t)\theta_{t} should of course be close to theta^(**)\theta^{*}.
This example can be placed in the form of the original class of equations (1.1). Let f_(i)f_{i} be the ii-th element of the function ff. Then, f_(i)(u,theta)=theta(u(t,x_(i+1))-2u(t,x_(i))+u(t,x_(i-1)))/(Deltax^(2))-u(t,x_(i))(u(t,x_(i+1))-u(t,x_(i-1)))/(2Delta x)f_{i}(u, \theta)=\theta \frac{u\left(t, x_{i+1}\right)-2 u\left(t, x_{i}\right)+u\left(t, x_{i-1}\right)}{\Delta x^{2}}-u\left(t, x_{i}\right) \frac{u\left(t, x_{i+1}\right)-u\left(t, x_{i-1}\right)}{2 \Delta x}. Similarly, let f_(i)^(**)f_{i}^{*} be the ii-th element of the function f^(**)f^{*}. Then, f_(i)^(**)(u)=f_(i)(u,theta^(**))f_{i}^{*}(u)=f_{i}\left(u, \theta^{*}\right).
We study 525 cases. For each case, a different theta^(**)\theta^{*} is generated uniformly at random in the range [.1,10]. This represents a wide range of physical cases of interest, with theta^(**)\theta^{*} ranging over two orders of magnitude. For each case, we solve for the parameter theta_(t)\theta_{t} over the time period [0,T][0, T] for T=100T=100.
The accuracy of theta_(t)\theta_{t} at times t=10^(-1),10^(0),10^(1)t=10^{-1}, 10^{0}, 10^{1}, and 10^(2)10^{2} is reported in Table 3 . Figures 6 and 7 plot the mean error in percent and mean squared error against time. The convergence of theta_(t)\theta_{t} to theta^(**)\theta^{*} is fairly rapid in time.
Error/Time
10^(-1)10^{-1}
10^(0)10^{0}
10^(1)10^{1}
10^(2)10^{2}
Maximum Error
.1047
.106
.033
.0107
99% quantile of error
.08
.078
.0255
.00835
Mean squared error
1.00 xx10^(-3)1.00 \times 10^{-3}
9.25 xx10^(-4)9.25 \times 10^{-4}
1.02 xx10^(-4)1.02 \times 10^{-4}
1.12 xx10^(-5)1.12 \times 10^{-5}
Mean Error in percent
1.26
1.17
0.4
0.13
Maximum error in percent
37.1
37.5
9.82
4.73
99% quantile of error in percent
12.6
18.0
5.64
1.38
Error/Time 10^(-1) 10^(0) 10^(1) 10^(2)
Maximum Error .1047 .106 .033 .0107
99% quantile of error .08 .078 .0255 .00835
Mean squared error 1.00 xx10^(-3) 9.25 xx10^(-4) 1.02 xx10^(-4) 1.12 xx10^(-5)
Mean Error in percent 1.26 1.17 0.4 0.13
Maximum error in percent 37.1 37.5 9.82 4.73
99% quantile of error in percent 12.6 18.0 5.64 1.38| Error/Time | $10^{-1}$ | $10^{0}$ | $10^{1}$ | $10^{2}$ |
| :---: | :---: | :---: | :---: | :---: |
| Maximum Error | .1047 | .106 | .033 | .0107 |
| 99% quantile of error | .08 | .078 | .0255 | .00835 |
| Mean squared error | $1.00 \times 10^{-3}$ | $9.25 \times 10^{-4}$ | $1.02 \times 10^{-4}$ | $1.12 \times 10^{-5}$ |
| Mean Error in percent | 1.26 | 1.17 | 0.4 | 0.13 |
| Maximum error in percent | 37.1 | 37.5 | 9.82 | 4.73 |
| 99% quantile of error in percent | 12.6 | 18.0 | 5.64 | 1.38 |
Table 3: Error at different times for the estimate theta_(t)\theta_{t} of theta^(**)\theta^{*} across 525 cases. The "error" is |theta_(t)^(n)-theta^(**,n)|\left|\theta_{t}^{n}-\theta^{*, n}\right| where nn represents the nn-th case. The "error in percent" is 100 xx|theta_(t)^(n)-theta^(**,n)|//|theta^(**,n)|100 \times\left|\theta_{t}^{n}-\theta^{*, n}\right| /\left|\theta^{*, n}\right|.
Figure 6: Mean error in percent plotted against time. Time is in log scale.
6.4. Reinforcement Learning
We consider the classic reinforcement learning problem of balancing a pole on a moving cart (see [6]). The goal is to balance a pole on a cart and to keep the cart from moving outside the boundaries via applying a force of \pm 10 Newtons.
The position xx of the cart, the velocity x^(˙)\dot{x} of the cart, angle of the pole beta\beta, and angular velocity beta^(˙)\dot{\beta} of the pole are observed. The dynamics of s=(x,x^(˙),beta,beta^(˙))s=(x, \dot{x}, \beta, \dot{\beta}) satisfy a set of ODEs (see [6]):
where gg is the acceleration due to gravity, m_(c)m_{c} is the mass of the cart, mm is the mass of the pole, 2l2 l is the length of the pole, mu_(c)\mu_{c} is the coefficient of friction of the cart on the ground, mu_(p)\mu_{p} is the coefficient of friction of the pole on the cart, and F_(t)in{-10,10}F_{t} \in\{-10,10\} is the force applied to the cart.
For this example, f^(**)(s)=(x^(˙),x^(¨),beta^(˙),beta^(¨))f^{*}(s)=(\dot{x}, \ddot{x}, \dot{\beta}, \ddot{\beta}). The model f(s,theta)=(f_(1)(s,theta),f_(2)(s,theta),f_(3)(s,theta),f_(4)(s,theta))f(s, \theta)=\left(f_{1}(s, \theta), f_{2}(s, \theta), f_{3}(s, \theta), f_{4}(s, \theta)\right) where f_(i)(s,theta)f_{i}(s, \theta) is a single-layer neural network with rectified linear units.
Figure 7: Mean squared error plotted against time. Time is in log scale.
where theta={W^(2,i),W^(1,i),b^(1,i),b^(2,i)}_(i=1)^(4)\theta=\left\{W^{2, i}, W^{1, i}, b^{1, i}, b^{2, i}\right\}_{i=1}^{4} and h(z)=(sigma(z_(1)),dots,sigma(z_(d)))h(z)=\left(\sigma\left(z_{1}\right), \ldots, \sigma\left(z_{d}\right)\right) for z inR^(d)z \in \mathbb{R}^{d}. The function sigma:RrarrR\sigma: \mathbb{R} \rightarrow \mathbb{R} is a rectified linear unit (ReLU): sigma(v)=max(v,0)\sigma(v)=\max (v, 0). We learn the parameter theta\theta using continuous-time stochastic gradient descent.
The boundary is x=+-2.4x= \pm 2.4 meters and the pole must not be allowed to fall below beta=(24)/(360 pi)\beta=\frac{24}{360 \pi} radians (the frame of reference is chosen such that the perfectly upright is 0 radians). A reward of +1 is received every 0.02 seconds if ||x|| <= 2.4\|x\| \leq 2.4 and ||theta|| <= (24)/(360 pi)\|\theta\| \leq \frac{24}{360 \pi}. A reward of -100 is received (and the episode ends) if the cart moves beyond x=+-2.4x= \pm 2.4 or the pole falls below beta=(24)/(360 pi)\beta=\frac{24}{360 \pi} radians. The sum of these rewards across the entire episode is the reward for that episode. The initial state (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta}) at the start of an episode is generated uniformly at random in [-.05,.05]^(4)[-.05, .05]^{4}. For our numerical experiment, we assume that the rule for receiving the rewards and the distribution of the initial state are both known. An action of \pm 10 Newtons may be chosen every 0.02 seconds. This force is then applied for the duration of the next 0.02 seconds. The system (5.5) is simulated using an Euler scheme with a time step size of 10^(-3)10^{-3} seconds.
The goal, of course, is to statistically learn the optimal actions in order to achieve the highest possible reward. This requires both: 1) statistically learning the physical dynamics of (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta}) and 2)) finding the optimal actions given these dynamics in order to achieve the highest possible reward. The dynamics (x,x^(˙),beta,beta^(˙))(x, \dot{x}, \beta, \dot{\beta}) satisfy the set of ODEs (5.5)(5.5); these dynamics can be learned using continuous-time stochastic gradient descent. We use a neural network for ff. Given the estimated dynamics ff, we use a policy gradient method to estimate the optimal actions. The approach is summarized below.
For episodes 0,1,2,dots0,1,2, \ldots :
For time [0,T_("end of episode ")]\left[0, T_{\text {end of episode }}\right] :
Update the model f(s,theta)f(s, \theta) for the dynamics using continuous-time stochastic gradient descent.
Periodically update the optimal policy mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right) using policy gradient method. The optimal policy is learned using data simulated from the model f(s,theta)f(s, \theta). Actions are randomly selected via the policy mu\mu.
The policy mu\mu is a neural network with parameters theta^(mu)\theta^{\mu}. We use a single hidden layer with rectified linear units followed by a softmax layer for mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right) and train it using policy gradients 1 The policy mu(s,a,theta^(mu))\mu\left(s, a, \theta^{\mu}\right)
gives the probability of taking action aa conditional on being in the state ss.
where sigma_(0)(v)=(e^(v))/(1+e^(v))\sigma_{0}(v)=\frac{e^{v}}{1+e^{v}}. Of course, P[F_(t)=-10∣s_(t)=s]=mu(s,-10,theta^(mu))=1-mu(s,10,theta^(mu))\mathbb{P}\left[F_{t}=-10 \mid s_{t}=s\right]=\mu\left(s,-10, \theta^{\mu}\right)=1-\mu\left(s, 10, \theta^{\mu}\right).
525 cases are run, each for 25 hours. The optimal policy is learned using the estimated dynamics f(s,theta)f(s, \theta) and is updated every 5 episodes. Table 4 reports the results at fixed episodes using continuous-time stochastic gradient descent. Table 5 reports statistics on the number of episodes required until a target episodic reward (100,500,1000)(100,500,1000) is first achieved.
Table 5: For each case, we record the number of episodes required until the target reward is first achieved using continuous-time stochastic gradient descent. Statistics (maximum, quantiles, mean, minimum) for the number of episodes required until the target reward is first achieved.
Alternatively, one could directly apply policy gradient to learn the optimal action using the observed data. This approach does not use continuous-time stochastic gradient descent to learn the model dynamics, but instead directly learns the optimal policy from the data. Again using 525 cases, we report the results in Table 6 for directly learning the optimal policy without using continuous-time stochastic gradient descent to learn the model dynamics. Comparing Tables 4 and 6 , it is clear that using continuous-time stochastic gradient descent to learn the model dynamics allows for the optimal policy to be learned significantly more quickly. The rewards are much higher when using continuous-time stochastic gradient descent (see Table 4 ) than when not using it (see Table6).
Table 6: Reward at the kk-th episode across the 525 cases using policy gradient to learn the optimal policy.
6.5. Estimating both the drift and volatility functions for the multidimensional CIR process
We now implement an example where SGDCT is used to estimate both the drift function and the volatility function. The multidimensional CIR process X_(t)inR^(d)X_{t} \in \mathbb{R}^{d} is:
dX_(t)=c(m-X_(t))dt+sqrt(X_(t))o.sigma dW_(t),d X_{t}=c\left(m-X_{t}\right) d t+\sqrt{X_{t}} \odot \sigma d W_{t},
where o.\odot is element-wise multiplication, m inR^(d),c,sigma inR^(d xx d),W_(t)inR^(d)m \in \mathbb{R}^{d}, c, \sigma \in \mathbb{R}^{d \times d}, W_{t} \in \mathbb{R}^{d}, with cc being a positive definite matrix. The CIR process is often used for modeling interest rates.
In equation (5.8), f(x,theta)=c(m-x)f(x, \theta)=c(m-x) where theta=(c,m).f^(**)(x)=f(x,theta^(**))\theta=(c, m) . f^{*}(x)=f\left(x, \theta^{*}\right) where theta^(**)=(c^(**),m^(**))\theta^{*}=\left(c^{*}, m^{*}\right). The volatility model is sigma(x,nu)=sqrtxo.nu\sigma(x, \nu)=\sqrt{x} \odot \nu and sigma^(**)(x)=sigma(x,nu^(**))\sigma^{*}(x)=\sigma\left(x, \nu^{*}\right) where nu,nu^(**)inR^(d xx d)\nu, \nu^{*} \in \mathbb{R}^{d \times d}. Table 7 reports the accuracy of SGDCT for estimating the drift and volatility functions of the CIR process.
Error/Parameter c m (sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)
Maximum Error 0.0157 0.009 0.010
99% quantile of error 0.010 0.007 0.008
99.9% quantile of error 0.0146 0.009 0.010
Mean squared error 1.49 xx10^(-5) 6.65 xx10^(-6) 4.21 xx10^(-6)
Mean Error in percent 0.21 0.137 0.0623
Maximum error in percent 1.12 0.695 0.456
99% quantile of error in percent 0.782 0.506 0.415
99.9% quantile of error in percent 1.06 0.616 0.455| Error/Parameter | $c$ | $m$ | $\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right)$ |
| :---: | :---: | :---: | :---: |
| Maximum Error | 0.0157 | 0.009 | 0.010 |
| 99% quantile of error | 0.010 | 0.007 | 0.008 |
| $99.9 \%$ quantile of error | 0.0146 | 0.009 | 0.010 |
| Mean squared error | $1.49 \times 10^{-5}$ | $6.65 \times 10^{-6}$ | $4.21 \times 10^{-6}$ |
| Mean Error in percent | 0.21 | 0.137 | 0.0623 |
| Maximum error in percent | 1.12 | 0.695 | 0.456 |
| $99 \%$ quantile of error in percent | 0.782 | 0.506 | 0.415 |
| $99.9 \%$ quantile of error in percent | 1.06 | 0.616 | 0.455 |
Table 7: Accuracy is reported in percent and averaged across 317 simulations. Each simulation has a different random initialization for c,mc, m, and sigma\sigma. The dimension d=3d=3, the time step size is 10^(-2)10^{-2}, and accuracy is evaluated at the final time 5xx10^(5)5 \times 10^{5}. X_(t)X_{t} is simulated using 4. Observations of the quadratic variation are generated from (sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right) at times t=0,.01,.02,dots(sqrt(X_(t))o.sigma)^(TT)(sqrt(X_(t))o.sigma)t=0, .01, .02, \ldots\left(\sqrt{X_{t}} \odot \sigma\right)^{\top}\left(\sqrt{X_{t}} \odot \sigma\right) is the quadratic variation per unit of time. For each simulation, the average error (or average percent error) for the quadratic variation per unit time is calculated by averaging across many points in the path of X_(t)X_{t}. Then, the statistics in the third column of the table are calculated using the average errors (or average percent errors) from the 317 simulations.
7. American Options
High-dimensional American options are extremely computationally challenging to solve with traditional numerical methods such as finite difference. Here we propose a new approach using statistical learning to solve high-dimensional American options. SGDCT achieves a high accuracy on two benchmark problems with 100 dimensions.
7.1. Q-learning
Before describing the SGDCT algorithm for American options, it is important to note that traditional stochastic gradient descent faces certain difficulties in this class of problems. Some brief remarks are provided below regarding this fact; the authors plan to elaborate on these issues in more detail in a future work. The well-known Q-learning algorithm uses stochastic gradient descent to minimize an approximation to the discrete-time Hamilton-Jacobi-Bellman equation. To demonstrate the challenges and the issues that arise, consider using Q-learning to estimate the value function:
V(x)=E[int_(0)^(oo)e^(-gamma t)r(X_(t))dt∣X_(0)=x],quadX_(t)=x+W_(t),V(x)=\mathbb{E}\left[\int_{0}^{\infty} e^{-\gamma t} r\left(X_{t}\right) d t \mid X_{0}=x\right], \quad X_{t}=x+W_{t},
where gamma > 0\gamma>0 is a discount factor and r(x)r(x) is a reward function. The function Q(x,theta)Q(x, \theta) is an approximation for the value function V(x)V(x). The parameter theta\theta must be estimated. The traditional approach would discretize the dynamics (6.1) and then apply a stochastic gradient descent update to the objective function:
Note that we have scaled the learning rate in 6.3 by (1)/(Delta)\frac{1}{\Delta}. This is the correct scaling for taking the limit Delta rarr0\Delta \rightarrow 0. The algorithm (6.3) has a major computational issue. If the process X_(t)X_{t} is high-dimensional, E[Q(X_(t+Delta);theta_(t))∣X_(t)]\mathbb{E}\left[Q\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right] is computationally challenging to calculate, and this calculation must be repeated for a large number of samples (millions to hundreds of millions). It is also important to note that for the American option example that follows the underlying dynamics are known. However, in reinforcement learning applications the transition probability is unknown, in which case E[Q(X_(t+Delta);theta_(t))∣X_(t)]\mathbb{E}\left[Q\left(X_{t+\Delta} ; \theta_{t}\right) \mid X_{t}\right] cannot be calculated. To circumvent these obstacles, the Q-learning algorithm ignores the inner expectation in 6.2 , leading to the algorithm:
Although now computationally efficient, the Q-learning algorithm 6.4 is now biased (due to ignoring the inner expectations). Furthermore, when Delta rarr0\Delta \rightarrow 0, the Q-learning algorithm (6.4) blows up. A quick investigation shows that the term (1)/(Delta)(W_(t+Delta)-W_(t))^(2)=O(1)\frac{1}{\Delta}\left(W_{t+\Delta}-W_{t}\right)^{2}=O(1) arises while all other terms are O(Delta)O(\Delta) or O(sqrtDelta)O(\sqrt{\Delta}).
The SGDCT algorithm is unbiased and computationally efficient. It can be directly derived by letting Delta rarr0\Delta \rightarrow 0 and using Itô's formula in (6.3)(6.3) :
dtheta_(t)=-alpha_(t)((1)/(2)Q_(theta xx)(X_(t);theta_(t))-gammaQ_(theta)(X_(t);theta_(t)))(r(X_(t))+(1)/(2)Q_(xx)(X_(t);theta_(t))-gamma Q(X_(t);theta_(t)))dt.d \theta_{t}=-\alpha_{t}\left(\frac{1}{2} Q_{\theta x x}\left(X_{t} ; \theta_{t}\right)-\gamma Q_{\theta}\left(X_{t} ; \theta_{t}\right)\right)\left(r\left(X_{t}\right)+\frac{1}{2} Q_{x x}\left(X_{t} ; \theta_{t}\right)-\gamma Q\left(X_{t} ; \theta_{t}\right)\right) d t .
Note that computationally challenging terms in (6.3) become differential operators in (6.7), which are usually easier to evaluate. This is one of the advantages of developing the theory in continuous time for continuoustime models. Once the continuous-time algorithm is derived, it can be appropriately discretized for numerical solution.
7.2. SGDCT for American Options
Let X_(t)inR^(d)X_{t} \in \mathbb{R}^{d} be the prices of dd stocks. The maturity date is time TT and the payoff function is g(x):R^(d)rarrRg(x): \mathbb{R}^{d} \rightarrow \mathbb{R}. The stock dynamics and value function are:
{:[dX_(t)^(i)=mu(X_(t)^(i))dt+sigma(X_(t)^(i))dW_(t)^(i)","],[V(t","x)=s u p_(tau >= t)E[e^(-r(tau^^T))g(X_(tau^^T))∣X_(t)=x]","]:}\begin{aligned}
d X_{t}^{i} & =\mu\left(X_{t}^{i}\right) d t+\sigma\left(X_{t}^{i}\right) d W_{t}^{i}, \\
V(t, x) & =\sup _{\tau \geq t} \mathbb{E}\left[e^{-r(\tau \wedge T)} g\left(X_{\tau \wedge T}\right) \mid X_{t}=x\right],
\end{aligned}
where W_(t)inR^(d)W_{t} \in \mathbb{R}^{d} is a Brownian motion. The distribution of W_(t)W_{t} is specified by Var[W_(t)^(i)]=t\operatorname{Var}\left[W_{t}^{i}\right]=t and Corr[W_(t)^(i),W_(t)^(j)]=\operatorname{Corr}\left[W_{t}^{i}, W_{t}^{j}\right]=rho_(i,j)\rho_{i, j} for i!=ji \neq j. The SGDCT algorithm for an American option is:
L_(x)\mathcal{L}_{x} is the infinitesimal generator for the XX process. The continuous-time algorithm 6.7) is run for many iterations n=0,1,2,dotsn=0,1,2, \ldots until convergence. See the authors' paper 28] for implementation details on pricing American options with deep learning.
We implement the SGDCT algorithm (6.7) using a deep neural network for the function Q(t,x;theta)Q(t, x ; \theta). Two benchmark problems are considered where semi-analytic solutions are available. The SGDCT algorithm's accuracy is evaluated for American options in d=100d=100 dimensions, and the results are presented in Table 8
Model Number of dimensions Payoff function Accuracy
Bachelier 100 g(x)=max((1)/(d)sum_(i=1)^(d)x_(i)-K,0) 0.1%
Black-Scholes 100 g(x)=max((prod_(i=1)^(d)x_(i))^(1//d)-K,0) 0.2%| Model | Number of dimensions | Payoff function | Accuracy |
| :---: | :---: | :---: | :---: |
| Bachelier | 100 | $g(x)=\max \left(\frac{1}{d} \sum_{i=1}^{d} x_{i}-K, 0\right)$ | $0.1 \%$ |
| Black-Scholes | 100 | $g(x)=\max \left(\left(\prod_{i=1}^{d} x_{i}\right)^{1 / d}-K, 0\right)$ | $0.2 \%$ |
Table 8: For the Bachelier model, mu(x)=r-c\mu(x)=r-c and sigma(x)=sigma\sigma(x)=\sigma. For Black-Scholes, mu(x)=(r-c)x\mu(x)=(r-c) x and sigma(x)=sigma x\sigma(x)=\sigma x. All stocks are identical with correlation rho_(i,j)=.75\rho_{i, j}=.75, volatility sigma=.25\sigma=.25, initial stock price X_(0)=1X_{0}=1, dividend rate c=0.02c=0.02, and interest rate r=0r=0. The maturity of the option is T=2T=2 and the strike price is K=1K=1. The accuracy is reported for the price of the at-the-money American call option.
8. A On a related Poisson equation
We recall the following regularity result from 24] on the Poisson equations in the whole space, appropriately stated to cover our case of interest.
Theorem A.1. Let Conditions 2.2 and 2.3 be satisfied. Assume that G(x,theta)inC^(alpha,2)(X,R^(n))G(x, \theta) \in C^{\alpha, 2}\left(\mathcal{X}, \mathbb{R}^{n}\right),
has a unique solution that satisfies u(x,*)inC^(2)u(x, \cdot) \in C^{2} for every x inX,del_(theta)^(2)u in C(XxxR^(n))x \in \mathcal{X}, \partial_{\theta}^{2} u \in C\left(\mathcal{X} \times \mathbb{R}^{n}\right) and there exist positive constants K^(')K^{\prime} and q^(')q^{\prime} such that
sum_(i=0)^(2)|(del^(i)u)/(deltheta^(i))(x,theta)|+|(del^(2)u)/(del x del theta)(x,theta)| <= K^(')(1+|x|^(q^('))).\sum_{i=0}^{2}\left|\frac{\partial^{i} u}{\partial \theta^{i}}(x, \theta)\right|+\left|\frac{\partial^{2} u}{\partial x \partial \theta}(x, \theta)\right| \leq K^{\prime}\left(1+|x|^{q^{\prime}}\right) .
9. References
[1] Y. Ait-Sahalia, Maximum Likelihood Estimation of Discretely Sampled Diffusions: A Closed-form Approximation Approach, Econometrica, Vol. 70, No. 1, 2002, pp. 223-262.
[2] Y. Ait-Sahalia, Closed-form likelihood expansions for multivariate diffusions, Annals of Statistics, Vol. 36, No. 2, 2008, pp. 906-937.
[3] A. Alabert and I. Gyongy, On numerical approximation of stochastic Burger's equation, From stochastic calculus to mathematical finance, Springer Berling Heidelberg, 2006. 1-15.
[4] A. Alfonsi, High order discretization schemes for the CIR process: application to affine term structure and Heston models, Mathematics of Computation, Vol. 79, No. 269, 2010, pp. 209-237.
[5] R. Ahlip and M. Rutkowski, Pricing of foreign exchange options under the Heston stochastic volatility model and CIR interest rates, Quantitative Finance, Vol. 13, No. 6, 2013, pp. 955-966.
[6] A. Barto, R. Sutton, and C. Anderson, Neuronlike Adaptive Elements that can solve difficult learning control problem, IEEE Transactions on Systems, Man, and Cybernetics, Vol.5, 1983, pp. 834-846.
[7] I. Basawa and B. Rao, Asymptotic inference for stochastic processes, Stochastic Processes and their Applications, Vol.10, No. 3, 1980, pp. 221-254.
[8] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, 2012.
[9] Dimitri P. Bertsekas and John N. Tsitsiklis, Gradient convergence in gradient methods via errors, SIAM Journal of Optimization, Vol.10, No. 3, 2000, pp. 627-642.
[10] J. P. N. Bishwal, Parameter estimation in stochastic differential equations, in: Lecture Notes in Mathematics, Vol. 1923, Springer Science & Business Media, 2008.
[11] S. Brown and P. Dybvig, The empirical implications of the Cox, Ingersoll, Ross theory of the term structure of interest rates, The Journal of Finance, Vol.41, No. 3, 1986, pp. 617-630.
[12] Q. Dai and K. Singleton, Specification analysis of affine term structure models, The Journal of Finance, Vol. 55, No. 5, 2000, pp. 1943-1978.
[13] A. Davie and J. Gaines, Convergence of numerical schemes for the solution of parabolic stochastic partial differential equations, Mathematics of Computation, Vol. 70, No. 233, (2001), pp. 121-134.
[14] O. Elerian, S. Chib, and N. Shephard, Likelihood inference for discretely observed nonlinear diffusions, Econometrica, Vol. 69, No. 4, (2001), pp. 959-993.
[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Book in preparation for MIT Press, 2016.
[16] H. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, Second Edition. Springer, 2003.
[17] Y. Kutoyants, Statistical inference for ergodic diffusion processes, Springer Science & Business Media, 2004.
[18] V. Linetsky, Computing hitting time densities for CIR and OU diffusions: applications to mean-reverting models, Journal of Computational Finance, Vol. 7, 2004, pg. 1-22. [19] T. Leung and X. Li, Optimal mean reversion trading with transaction costs and stop-loss exit, International Journal of Theoretical and Applied Finance, Vol. 18, No. 3, 2015, pg. 155020.
[20] T. Leung, J. Li, X. Li, and Z. Wang, Speculative futures trading under mean reversion, Asia-Pacific Financial Markets, Vol. 23, No. 4, 2015, pp. 281-304.
[21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation to stochastic programming, SIAM Journal of Optimization, Vol.19, No. 4, 2009, pp. 1574-1609.
[22] Y. Maghsoodi, Solution of the extended CIR term structure and bond option valuation, Mathematical Finance, Vol. 6, No. 1, 1996, pp. 89-109.
[23] E. Pardoux and A.Yu. Veretennikov, On Poisson equation and diffusion approximation 1, Annals of Probability, Vol. 29, No. 3, 2001, pp. 1061-1085.
[24] E. Pardoux and A. Y. Veretennikov, On Poisson equation and diffusion approximation 2, The Annals of Probability, Vol. 31, No. 3, 2003, pp. 1166-1192.
[25] M. Raginsky and J. Bouvrie, Continuous-time stochastic mirror descent on a network: variance reduction, consensus, convergence, IEEE Conference on Decision and Control, 2012.
[26] B. L. S. P. Rao, Statistical inference for diffusion type processes, Arnold, 1999.
[27] G. O. Roberts and R.L. Tweedie, Exponential convergence of Langevin distributions and their discrete approximations, Bernoulli, 1996, pp. 341-363.
[28] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, 2017, arXiv: 1708.07469.
[29] O. Vasicek, An equilibrium characterization of the term structure, Journal of financial economics, Vol. 5, No. 2, 1977, pp. 177-188.
*Department of Industrial and Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, Urbana, Email: jasirign@illinois.edu
^(†){ }^{\dagger} Department of Mathematics and Statistics, Boston University, Boston, E-mail: kspiliop@math.bu.edu
^(‡){ }^{\ddagger} Research of K.S. supported in part by the National Science Foundation (DMS 1550918)
§\S§ Computations for this paper were supported by a Blue Waters supercomputer grant.
^(1){ }^{1} Let r_(e,t)r_{e, t} be the reward for episode ee at time tt. Let R_(t,e)=sum_(t^(')=t+1)^(T_("end ")" of episode ")gamma^(t^(')-t)r_(e,t^('))R_{t, e}=\sum_{t^{\prime}=t+1}^{T_{\text {end }} \text { of episode }} \gamma^{t^{\prime}-t} r_{e, t^{\prime}} be the cumulative discounted reward from episode ee after time tt where gamma in[0,1]\gamma \in[0,1] is the discount factor. Stochastic gradient descent is used to learn the parameter theta^(mu)\theta^{\mu} : theta^(mu)larrtheta^(mu)+eta_(e)R_(t,e)(del)/(deltheta^(mu))log mu(s_(t),a_(t),theta^(mu))\theta^{\mu} \leftarrow \theta^{\mu}+\eta_{e} R_{t, e} \frac{\partial}{\partial \theta^{\mu}} \log \mu\left(s_{t}, a_{t}, \theta^{\mu}\right) where eta_(e)\eta_{e} is the learning rate. In practice, the cumulative discounted rewards are often normalized across an episode.